Versatile design of shared vector coprocessors for multicores

نویسندگان

Spiridon F. Beldianu

Christopher Dahlberg

Timothy Steele

Sotirios G. Ziavras

چکیده

For most of the applications that make use of a vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism, which often occurs due to insufficient vector parallelism or vector-length variations in dynamic environments. The motivation of our work stems from (a) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (b) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced area and energy costs. We have previously proposed three basic shared vector coprocessor architectures based on coarse-grain temporal, fine-grain temporal, and vector lane sharing that were implemented in SystemVerilog [15]. Our new paper presents substantially improved versions of these architectures that are implemented in synthesized RTL for higher accuracy. We herein evaluate these vector coprocessor sharing policies for a dual-core system using the floating-point performance, resource utilization and power consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, LU decomposition and sparse matrix vector multiplication shows that these coprocessor sharing policies yield high utilization, high performance and low energy per operation. Fine-grain temporal sharing most often provides the best performance among the three policies; it is followed by vector lane and then coarse-grain temporal sharing. It is also shown that, per core exclusive access to the vector resources does not maximize their utilization. This benchmarking involves various scenarios for each application, where the scenarios differ in terms of the vector length and the parallelism-oriented coding technique. KeywordsVector coprocessor, coprocessor sharing, multicore, FPGA prototyping.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Understanding Concurrency for Graph Workloads in Large Scale Multicores

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, and vision. However...

متن کامل

Towards Power-Aware Data Pipelining on Multicores

Power consumption management has become a major concern in software development. Continuous streaming computations are usually composed by different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e., the concurrency control) is a critical aspect for both performance and power consumption. In this paper, we describe the design...

متن کامل

An Analysis of an Interrupt-Driven Implementation of the Master-Worker Model with Application-Specific Coprocessors

In this thesis, we present a versatile parallel programming model composed of an individual general-purpose processor aided by several application-specific coprocessors. These computing units operate under a simplification of the master-worker model. The user-defined coprocessors may be either homogeneous or heterogeneous. We analyze system performance with regard to system size and task granul...

متن کامل

Or-Parallel Prolog Execution on Clusters of Multicores

Logic Programming languages, such as Prolog, provide an excellent framework for the parallel execution of logic programs. In particular, the inherent non-determinism in the way logic programs are structured makes Prolog very attractive for the exploitation of implicit parallelism. One of the most noticeable sources of implicit parallelism in Prolog programs is or-parallelism. Or-parallelism ari...

متن کامل

A Simple Statistical Cache Sharing Model for Multicores

The introduction of multicores has made analysis of shared resources, such as shared caches and shared DRAM bandwidth, an important topic to study. We present two simple, but accurate, cache sharing models that use high-level data that can easily be measured on existing systems. We evaluate our model using a simulated multicore processor with four cores and a shared L2 cache. Our evaluation sho...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

Microprocessors and Microsystems - Embedded Hardware Design

دوره 36 شماره

صفحات -

تاریخ انتشار 2012

Versatile design of shared vector coprocessors for multicores

نویسندگان

چکیده

منابع مشابه

Understanding Concurrency for Graph Workloads in Large Scale Multicores

Towards Power-Aware Data Pipelining on Multicores

An Analysis of an Interrupt-Driven Implementation of the Master-Worker Model with Application-Specific Coprocessors

Or-Parallel Prolog Execution on Clusters of Multicores

A Simple Statistical Cache Sharing Model for Multicores

عنوان ژورنال:

اشتراک گذاری